1 00:00:00,000 --> 00:00:14,040 This is Hacker Public Radio episode 4,084 for Thursday the 28th of March 2024. 2 00:00:14,040 --> 00:00:18,120 Today's show is entitled, Cloud Learning. 3 00:00:18,120 --> 00:00:22,840 It is hosted by Daniel Person and is about 10 minutes long. 4 00:00:22,840 --> 00:00:25,440 It carries a clean flag. 5 00:00:25,440 --> 00:00:42,640 The summary is, my experience trying to train a model online. 6 00:00:42,640 --> 00:00:47,880 Hello Hacker's and welcome to another episode, Daniel here and today I'm going to talk 7 00:00:47,880 --> 00:00:51,160 about Cloud Learning. 8 00:00:51,160 --> 00:00:56,760 I'm using a machine learning model and training it in the cloud. 9 00:00:56,760 --> 00:01:05,400 This was a topic that I went into this Christmas break because I was fed up with the 10 00:01:05,400 --> 00:01:06,400 advent of code. 11 00:01:06,400 --> 00:01:11,920 I couldn't really bear doing more of that, so I needed it at a topic to look into. 12 00:01:11,920 --> 00:01:19,200 So I said, okay, I have this model, whatever it is, in this case it was a TTS model. 13 00:01:19,200 --> 00:01:27,840 I wanted to train a voice to speak a particular language and create something that I could 14 00:01:27,840 --> 00:01:29,080 use later on. 15 00:01:29,080 --> 00:01:34,040 So I wanted to need to find somewhere in the cloud to train this model. 16 00:01:34,040 --> 00:01:42,080 I have figured out a training the model on my own, on my computer, will take about eight 17 00:01:42,080 --> 00:01:46,960 days to run through a full training cycle. 18 00:01:46,960 --> 00:01:53,360 But I looked online, I could find places where I could train it in eight hours or in 19 00:01:53,360 --> 00:01:59,200 10 hours, 12 hours and so on, depending on which graphics cards I were using, depending 20 00:01:59,200 --> 00:02:02,560 on how many of them I was running and so on. 21 00:02:02,560 --> 00:02:07,960 So I wanted to try to use this on in the cloud. 22 00:02:07,960 --> 00:02:10,800 And I was talking to different cloud vendors. 23 00:02:10,840 --> 00:02:14,880 So I reached out to Microsoft, looked at their solution. 24 00:02:14,880 --> 00:02:22,640 Sadly, after two hours of video watching, trying to learn their platform and how to set 25 00:02:22,640 --> 00:02:29,600 things up and where to go and what tools to use, I gave up because I still think Microsoft's 26 00:02:29,600 --> 00:02:35,160 way of structuring things is not intuitive to me. 27 00:02:35,160 --> 00:02:39,000 It's very confusing, so I couldn't really get into that. 28 00:02:39,000 --> 00:02:41,360 I didn't want to spend more time on it. 29 00:02:41,360 --> 00:02:50,440 Two hours trying to start even figuring out what tools to use is not fun for me. 30 00:02:50,440 --> 00:02:53,080 So I also tried Google. 31 00:02:53,080 --> 00:03:01,000 It was really easy to find my way there and I figured out that I wanted to use Vertex AI. 32 00:03:01,000 --> 00:03:05,040 What I wanted to use is the model that I already had. 33 00:03:05,080 --> 00:03:07,680 So I wanted to train that. 34 00:03:07,680 --> 00:03:12,600 A lot of days kind of cloud providers gives you these kind of notebooks that you should 35 00:03:12,600 --> 00:03:15,360 put in your model and run it there. 36 00:03:15,360 --> 00:03:21,360 But this model was so complex that I needed to check out to get repository or run a 37 00:03:21,360 --> 00:03:22,960 Docker image. 38 00:03:22,960 --> 00:03:30,080 But in Vertex AI, you could run your own Docker images and connect them to cloud storage, 39 00:03:30,080 --> 00:03:32,720 which was not that complicated actually. 40 00:03:32,720 --> 00:03:37,120 You could do that pretty simply and I set up something that could train and then I 41 00:03:37,120 --> 00:03:44,400 wanted to run it on some GPU power and there were the problem because Google don't give 42 00:03:44,400 --> 00:03:48,960 you an graphics cards if you don't ask them for it. 43 00:03:48,960 --> 00:03:53,480 So you needed to sign up and ask for a graphics card. 44 00:03:53,480 --> 00:03:57,920 So on Christmas day, I asked for a graphics card. 45 00:03:58,000 --> 00:04:05,040 I asked for one graphics card of one type and four graphics card of an older type. 46 00:04:05,040 --> 00:04:14,400 And it took me about four weeks, three to four weeks until they actually gave me access to one 47 00:04:14,400 --> 00:04:15,920 of these cards. 48 00:04:15,920 --> 00:04:19,840 I haven't been able to run any jobs on that card yet. 49 00:04:19,840 --> 00:04:24,480 I still am still trying to figure that one out. 50 00:04:24,560 --> 00:04:33,680 But just asking for a card and that taking that much time was not really a super good experience. 51 00:04:33,680 --> 00:04:40,480 They said that it should take about two to three business days, not four weeks. 52 00:04:41,920 --> 00:04:43,840 But still I got access. 53 00:04:44,800 --> 00:04:50,480 Then I went over and looked at Amazon because I want to do of course try all of the big ones. 54 00:04:51,360 --> 00:05:01,040 And Amazon, frankly, just said no, you will not get any GPU power at us in our tooling. 55 00:05:01,040 --> 00:05:03,040 You need to use SageMaker. 56 00:05:03,040 --> 00:05:08,480 So SageMaker is pretty much use a notebook and train on GPUs in SageMaker. 57 00:05:09,280 --> 00:05:13,440 But you still need to ask for GPUs so I could be declined there as well. 58 00:05:13,440 --> 00:05:15,600 But you can't run your custom images. 59 00:05:16,240 --> 00:05:19,200 So what I wanted to do, I was not allowed to do. 60 00:05:19,280 --> 00:05:23,520 They just said no, which was really a really sad. 61 00:05:23,520 --> 00:05:25,520 So I couldn't really use their service as well. 62 00:05:26,480 --> 00:05:35,120 Then I found this service that was pay as you go and they were pretty much GPU as a service. 63 00:05:36,000 --> 00:05:39,120 And I'm going to release a video about them. 64 00:05:39,120 --> 00:05:42,720 I'm not sure if I actually can talk about the company. 65 00:05:42,720 --> 00:05:49,040 But if you're following my channel, you will see that video eventually when we have released that. 66 00:05:49,120 --> 00:05:54,400 So I'm talking and working with them around this video. 67 00:05:55,680 --> 00:06:00,000 It's a very early company, a early concept. 68 00:06:00,000 --> 00:06:03,360 So they haven't really released everything yet. 69 00:06:03,360 --> 00:06:10,480 But I think that their way of doing it is really interesting and the right way to do it. 70 00:06:11,120 --> 00:06:16,320 They have two different concepts pretty much or three different concepts. 71 00:06:16,400 --> 00:06:21,600 They have an environment where you can say I'm want to do things in this data center pretty much. 72 00:06:21,600 --> 00:06:25,920 So you have an environment and you can say that I want to run it this in Norway for instance. 73 00:06:26,720 --> 00:06:34,000 Then when you have set up that environment, you could create these kind of volumes where you put your data. 74 00:06:35,040 --> 00:06:41,840 So I created a volume of 100 gigabytes where I put all my data and my operating system and so on. 75 00:06:41,840 --> 00:06:44,800 And then you could start virtual machines. 76 00:06:44,800 --> 00:06:49,680 So I started a virtual machine with just CPU power and my volume. 77 00:06:50,240 --> 00:06:58,800 I installed all the different dependencies and so on that I needed with the CPU power, which was very 78 00:06:58,800 --> 00:07:07,040 cheap. So it was a cheap way of getting my dependencies and my model and all my data and so on 79 00:07:07,040 --> 00:07:10,720 into my model set up and ready to do some training. 80 00:07:11,680 --> 00:07:19,360 Then I shut down that CPU powered machine and I took a machine where they had 81 00:07:20,240 --> 00:07:25,280 different amount of graphics cards. So they were running a 82 00:07:25,920 --> 00:07:31,680 4,5,000, 6,000, a 100s, a 1,100s, a 83 00:07:31,680 --> 00:07:41,040 H100s and L40s. So I had a bunch of these kind of cards and I have machines that had one card, 84 00:07:41,040 --> 00:07:49,200 two cards, four cards and eight cards. So some of the machines were super powerful in order to train a lot. 85 00:07:49,920 --> 00:07:57,120 But of course it become more expensive to run with a lot of cards, but it was still affordable. 86 00:07:57,200 --> 00:08:03,200 I would say. So if you're running a load and you really want it to be done quick, 87 00:08:03,200 --> 00:08:09,360 you just put more GPUs on it and then start it up with your volume that you have prepared. 88 00:08:09,360 --> 00:08:15,680 Run your data load and then shut it down. Perhaps start a CPU load again in order to download the 89 00:08:15,680 --> 00:08:22,960 the thing. And the best thing of this is that you either run it directly in your Linux environment 90 00:08:22,960 --> 00:08:32,960 in this virtual machine or you could go into a VNC house and just look at it and run your things there 91 00:08:32,960 --> 00:08:42,720 or give it its own IP on the internet and login using SSH so you had full access to the machine. 92 00:08:42,720 --> 00:08:51,680 You could do whatever you want on these machines. So and just used GPUs. So it was GPU as a service 93 00:08:51,760 --> 00:08:59,360 in its purest ways. So I really liked that approach to train things online. 94 00:09:00,720 --> 00:09:09,520 I haven't found any other service that does it similarly and as well. There is of course the 95 00:09:09,520 --> 00:09:17,440 option of running in Linux which I can buy as bought up now. They have similar solutions but they're 96 00:09:17,520 --> 00:09:25,920 very expensive. What I've found so far. You also could run it in a digital ocean. I looked at those. 97 00:09:26,880 --> 00:09:34,000 They are very interesting. When you start there you need first give them five bucks in order to just 98 00:09:34,000 --> 00:09:40,400 sign up and when you had to sign up then you could get access to actually run things but you had to 99 00:09:40,400 --> 00:09:50,160 ask for machine power again and the only accepted to run notebooks. I had to pay them five bucks 100 00:09:50,160 --> 00:09:57,440 and I couldn't use them. When it actually came down to it. So yeah I can never see those five 101 00:09:57,440 --> 00:10:03,280 bucks again because I can't use them. So they said okay pay us five bucks and you will use that for 102 00:10:03,280 --> 00:10:10,080 your training later on but because I will never train with them I just gave them five bucks pretty 103 00:10:10,480 --> 00:10:19,680 much. So this is what I have experienced when I have tried to train machine learning tasks 104 00:10:19,680 --> 00:10:29,920 or models online using GPUs. Have you tried to do this and did you have a different experience? 105 00:10:29,920 --> 00:10:35,760 Perhaps you have tried Microsoft and found it very easy and could give me some hints. Perhaps 106 00:10:35,840 --> 00:10:43,280 record an episode explaining how to do that so I can figure it out myself as well. Or if you have 107 00:10:43,280 --> 00:10:49,200 any other experience please share with the rest of the community. I'm very interested in this topic. 108 00:10:49,840 --> 00:10:54,160 I hope that you liked this episode and I hope to see you in the next episode. 109 00:10:57,440 --> 00:11:02,400 You have been listening to Hecker Public Radio at Hecker Public Radio.org. 110 00:11:02,480 --> 00:11:08,400 Today's show was contributed by a HBR this night like yourself. If you ever thought of recording 111 00:11:08,400 --> 00:11:15,440 podcast, click on our contributally to find out how easy it means. Posting price we are 112 00:11:15,440 --> 00:11:22,640 has been kindly provided by an onsthost.com, the internet archive and our synced.net. 113 00:11:22,640 --> 00:11:27,600 On the satellite stages, today's show is released on our creative comments, 114 00:11:27,600 --> 00:11:34,800 attribution for going to international license.